Optimal Bounds on Feasible DNA Computing
نویسندگان
چکیده
A DNA program may be expressed as a sequence (not necessarily linear) of ligation, extraction and combination steps. As recognized by the researchers in the field, the extraction steps are subject to significant misclassification errors. The error accumulates through the successive extraction steps and may lead to completely wrong conclusions unless measures are taken to control the error. Many researchers (Lipton, Adleman, Karp, Winfree and others) proposed to limit the misclassification error by repeated application of extraction. All of them assumed that the misclassification probability of a DNA strand is given (or at least bounded) irrespective to the concentration of good strands in the initial tube. In addition, the previous researchers did not consider the probability of false negative and false positive as separate parameters. Further, none of the previous models on error-resilient DNA computing are suitable to study the effect of encoding of the problem on error propagation. In this paper we are proposing an accurate and analytical model of error propagation in the extraction step that considers the concentration of good strands and probabilities of false positive and false negative explicitly. In addition, the proposed model may potentially be used to study the effect of problem encoding on error resilience. Introduction and Terminology. DNA is the nature’s storage for genetic information. A DNA molecule is a double helix formed by two linear structures called the single strands (or sometimes just strands) of the DNA. Each strand may be viewed as a sequence of sugar molecules connected by covalent bonds in position 3 or position 5. A nucleotide hangs from each of the sugar molecules. There are four types of nucleotides, viz., Adenine, Guanine, Thymine and Cytosine, typically denoted by the letters A, G, T and C respectively. Therefore, from the information storage view point, a single strand of DNA is a sequence of letters A, G, T and C. Given a DNA strand, the position 3 of the last sugar molecule at the one end and the position 5 of the last sugar molecule at the other end are available for forming covalent bonds. This property gives an orientation of the strand called 3’-5’ orientation. Adenine molecules can form hydrogen bonds or Hbonds (weaker than covalent bonds) with thymine molecules and guanine molecules can form H-bonds with cytosine molecules. Hence adenine (A) is called complementary of thymine (T) and guanine (G) is called complementary of cytosine (C). This is called Watson-Crick complementation or pairing. Two H-bonds are formed between A and T and three H-bonds are formed between C and G. Therefore, C-G bonding is much stronger than A-T bonding and plays a vital role in stability of in vitro DNA molecules. If two DNA strands are of same length, oriented in opposite 3’-5’ orientation and WatsonCrick complementary in each position, they are called sticky DNA. If two sticky DNA come in contact, they form the famous double-helix structure so commonly found in popular science literature. For our purpose though, we’ll not attempt to draw the double helix, primarily because it is implied and largely unnecessary, but also for the ease of comprehension and to save labor. Figure 1 illustrates the concepts described here. Note that the half-arrow on individual strands directs toward the 3’ end. Like the 0-1 bits in an electronic computer, A, T, C, G characters in a DNA string are also capable of expressing information. Each test tube of DNA molecules contains huge amount of data (in the order of 10 or so). Adding a chemical to the test tube is equivalent to manipulating all data items using a single instruction, i.e., executing an instruction over a SIMD (single-instruction-multiple-data) parallel computer. In an electronic computer we manipulate information in a bit-by-bit (or byte-by-byte) fashion using Boolean logic operations AND, OR, NOT, XOR etc. The information stored in a DNA string may not be separated in convenient pieces very easily and may be manipulated only by chemical reactions, i.e., by adding reagents and by changing environmental parameters like temperature etc. Therefore, the standard SIMD model is not very convenient for understanding DNA computation. We are following Adleman’s model for describing DNA computation. Later other researchers proposed other models for DNA computing [Reif2]. However, the chemistry and lab work involved remain largely unchanged. As per Adleman’s description [Adle2] DNA computing begins with a suitable encoding of the entities involved in the problem description. The encoding uses a (improper) subset of the DNA character set {A, T, G, C}. Once the encoding is done on the paper, commercial laboratories may manufacture DNA molecules representing the encoded data. These artificial DNA molecules built to the specification are called primers. The primers are mixed with water, salt, an enzyme called ligase and a few other ingredients. If the mixture is kept in proper temperature, the primers in the solution join together to form larger and larger DNA molecules. This process is called ligation. It is to be noted that there are two types of ligation, one is called sticky-end ligation and the other is called blunt-end ligation. Billions of DNA molecules in the solution ligate together forming different larger strands of DNA. If the initial encoding is properly done, some of the new DNA strands (called ‘good’ strands) are expected to contain the solutions of the problem at hand while others (called ‘bad’ strands) will not. Ligation of smaller DNA strands is similar to the first part of the computation that takes place in a non-deterministic Turing machine (NDTM). An NDTM computes all possible strings of data that may be generated using the state transition rules and the input data. The second part of the NDTM computation involves choosing the correct solution out of all the solutions generated. In DNA computation a process called extraction does this. Usually we have a fairly good idea about good strands. Generally we know about the length of the solution, which in turn tells us the length of the DNA strands we are looking for. Further, we’ll normally know a subsequence characterization of the good strands, i.e., we’ll know a set of subsequences using which only good strands may be isolated. For isolating strands of a particular length a process called Gel Electrophoresis is used, where their length using a slab of gel and electrical field sorts the DNA molecules. Extraction using subsequence characterization is more complicated. To extract DNA strands with a particular subsequence a special primer is designed. The primers sequence is complimentary to the subsequence of interest. Iron balls coated with the primer are hung in the solution, a process called fishing. DNA strands containing the particular subsequence will form H-bonds with the primer molecules on the iron balls. The iron balls are eventually removed and washed in salt solutions to separate DNA strands. A few other terms in this context are restriction enzymes (cuts a DNA strand at a position defined by a particular subsequence), annealing (the process of joining sticky strands to form the double-helix or other double-stranded DNA molecules), denaturing (converting double stranded DNA molecules to single stranded DNA), combining (mixing contents of several test tubes), detect (detecting presence of good strands in a test tube and finding its sequence) etc. Obviously we loose good DNA strands during extraction due to misclassification as well as due to transfer loss. A mechanism called Polymerase Chain Reaction (PCR) is used to generate new good strands in the test tube. PCR works by a series of denaturing, polymerization and annealing. In this paper we do not consider the effect of PCR. We plan to analyze the effect of PCR in our subsequent research. Problem Description. A typical DNA computation may be described as a Directed Acyclic Graph (DAG). Each node of the DAG is a test tube (or tube for brevity). The DAG has exactly one node with in-degree zero. This node (or the source of the DAG) is called the initial tube. The initial tube is the tube is the tube containing the mixture of DNA strands right after the ligation step. There may be several tubes with zero outdegree. However, exactly one of them is marked as final or solution tube. The goal of the computation is to obtain a solution tube with a particular volume of DNA strings and a particular concentration of good or solution strings. Each tube in the DAG has exactly two out-degrees denoting an extraction step. However, the extraction step need not be simple extraction. It may be a compound or repeated extraction also. A tube with more than one in-degree denotes a combination step. If there is no directed path from a tube to the final tube, then that tube is not useful and may be marked waste tube. For the purpose of this research we assume that the volume (total number of DNA strings) and concentration (proportion of good strings to the volume) in the initial tube is known. Further, we assume that the volume and concentration requirement of the final tube is also known. We’ll show later that the volume and concentration in each tube must maintain a certain relationship. We call a DNA computation feasible, if given the initial volume and concentration; it is possible to achieve the required volume and concentration in the final tube while maintaining the volume-concentration relationship in each intermediate tube. Generally it is assumed that the good strings may be arbitrarily enhanced by arbitrarily many application of PCR step. However, this assumption is not entirely correct. In addition to the solution strings, PCR may amplify other strings and in addition may generate many unnecessary artifacts in a tube. We plan to analyze the effect of PCR in a subsequent paper. For the purpose of our present work we do not consider PCR step at all. The earlier researchers modeled the feasibility problem as follows. They assumed that during extraction step each good string moves to the correct tube with a fixed probability p. Hence it is possible to obtain a very high volume very high concentration `yes’ tube by repeating the simple extraction step a few times. They assumed that the extraction step is standard and performs identically independent of volume, concentration and other parameters. Related Works and Their Limitations. From the preceding discussion it is clear that a DNA computation algorithm may be described as a sequence of extract, combine and detect primitives. PCR steps may be introduced in any place of the algorithm and is believed that will only be beneficial by increasing the number of good strands. Winfree [Winfree1] called the PCR step as duplicate step. He formalized compound extraction step by representing it as a single-source two-sink directed acyclic graph (DAG). (It is to be noted that earlier we modeled the whole computation by means of a DAG.) Here we present a slightly enhanced form of Winfree’s model. The source of the DAG is the test tube containing good and bad strands whereas the sink nodes are two test tubes, one containing good strands and the other containing bad strands. All other nodes represent the extract or duplicate operation. If a node has two incoming edges there is an implied combine operation in that node. For each extract node there are two outgoing edges labeled σ and σ’. S(σ) is the subsequence that is used to extract DNA at this node. The edge labeled σ refers to the test tube with DNA strands including S(σ) and the edge labeled σ’ refers to the test tube not containing S(σ). Duplicate nodes have only one outgoing edge. Clearly, each extraction step introduces both the false positive and false negative errors. In other words, some of the good strands will be classified as bad strands (type I error) and some of the bad strands will be classified as good strands (type II error). The errors will be accumulated through a large number of steps and finally may lead to wrong result. While it is important to invent new extraction technologies to reduce errors in strand classification, computer scientists have made several attempts to make the existing methods more error resilient by combinatorial methods for repeated extraction and better problem design. The computing community primarily concentrated on variations of three methods of error reduction. Two of those were proposed by Leonard Adleman [Adle2] and the third one was proposed by Lipton and others [Lipton2]. Digestion based techniques are primarily applicable to reduced volume computation. As we are interested in general DNA computing paradigm, we’ll not discuss digestion-based methods here. Adleman’s first method involves repeated application of the extraction step on the ‘yes’ tube, i.e., the tube holding DNA strands containing the subsequence S(σ). If the false negative rate is high (i.e., probability of misclassification of a good strand as a bad strand is significant) this method will result in significant loss of good strands. Karp and others [Karp1] proposed a method to improve Adleman’s method of repeated extraction. He proposed to simulate reliable extraction method by multiple application of existing faulty extraction methods. This is called compound extract. Karp claimed that if δ is the desired error probability of the extract method and ε is the error probability of the available method, then error rate δ might be achieved by O(log ε 2 δ) applications of the available method in O(log ε δ) parallel steps. Chen and Winfree [Winfree1] formalized Karp’s method and proved that the constants involved are not large. We discuss Winfree’s method here, as that is the culminating point of this approach. Figure 2 explains Winfree’s method by providing an example. Suppose the DNA strands are to be extracted using the subsequence s. The source node of the DAG represents the initial test tube. The extraction in the first level provides two tubes. The extraction method is applied on each tube and thus three tubes are obtained, the middle tube is a combination of one yes and one no tube coming from left and from right respectively. The extraction steps may be repeated arbitrarily many times. It is to be noted that Winfree assumes that the error rates of the extraction method remain largely same from one layer to another layer and between test tubes in a single layer, an assumption we wish to contradict later. The second approach proposed by Adleman involves using PCR to amplify the number of good strands in a test tube. If the good strands are lost between steps then this mechanism may be used to replenish the supply of good strands. Boneh and others [Lipton2] proposed and analyzed the use of PCR to reduce the error rate in case of decreasing volume computation (where the bad strands are discarded as soon as they are discovered). However, many important problems like formula-SAT and breaking DES are not decreasing volume computation. Again Chen and Winfree extended Boneh’s method to the general bio-molecular computation paradigm. They assumed that all the strands are to be retained and may be used in subsequent steps. Using the assumption that PCR will amplify the number of good strands without significantly amplifying the number of bad strands they gave methods to compute the frequency of PCR applications for a desired success probability. It is to be noted that the performance of PCR will depend on the structure of bad strands and their concentration. Hence Chen and Winfree’s method needs to be modified to give better bounds. In fact in some cases the application of PCR may not be only beneficial and hence may not be advisable at a high frequency. The third approach (proposed by Lipton and others [Lipton2]) attempts to improve error resilience by choosing good encoding of data. In particular they propose to use encoding such that each strand will contain either S(σ) or S(σ’) but not both. This approach is good but the results were not developed sufficiently. Among other things, this approach increases the size of each strand and therefore limits the size of the problems those may be handled by using smaller DNA strings. Further, if the concentration of bad strands is high and there is a long common subsequence between S(σ) and some other part of the bad strands (or vice versa) then this simple coding method may not be very helpful. Attempting to avoid long common subsequences will increase the problem/solution size even further and should not be attempted without a clear idea of trade-offs. Analysis of Simple Extraction Step. A close look at the extraction by fishing will convince one that the assumption of constant (or bounded) misclassification probability of good DNA strand is not a good assumption. To fish good strands from a tube, one needs to dip iron balls coated with suitable primer molecules in the tube. The primer molecules, if suitably designed, will anneal with the good strands and come out with the iron balls. If only good strands could anneal with the primer molecules then there would be no problem and the simple extraction mechanism would have worked perfectly all right. However, annealing is a blind mechanism. A primer molecule will bind with any strand with a properly oriented and long complimentary subsequence. Therefore, in practice, the bad strands may also anneal with primer molecules. But the probability of annealing with a bad strand is supposed to be lower in comparison to the probability of annealing with a good strand. Therefore, fishing a molecule from a test tube involves two steps. In first step, the primer molecule needs to meet the intended molecule and in the second step the primer needs to anneal with the intended molecule. It is to be noted that meeting a molecule with an appropriate complimentary subsequence does not mean that the annealing will always occur. Meeting of the primer and the molecule does not guarantee meeting of the complimentary sections of the molecule and the primer. Therefore, we model the situation by assigning a probability of annealing given that the primer and the molecule met. In particular, we model the situation using two different probabilities. The first is the probability of annealing to a good molecule after meeting, denoted by p, and the second is the probability of annealing to a bad molecule after meeting, denoted by p’. Typically the value of p will be 0.9 or greater and the value of p’ will be 0.2 or smaller. The actual value of p and p’ will depend on several factors like the structures of good strands, bad strands and primers, temperature and other physical and chemical parameters etc. A good molecule may not anneal to a primer molecule after meeting with probability q=1-p. Further, we define q’=1-p’. When a primer anneals to a bad strand, we call it the error of false positive or type I error. When a primer fails to anneal to a good strand, we call it the error of false negative or type II error. In our model, p’ is the probability of type I error and q is the probability of type II error. A simple extraction step begins with an initial test tube containing a mixture of good and bad strands. The good strands are characterized by some subsequence present in all good strands and absent in all bad strands. Good strands are fished out from the initial tube and stored in a tube called ‘yes’ tube. Once the fishing is over, the remaining tube is supposed to contain mostly bad strands and called ‘no’ tube. Let VI denote the total number of strands in the initial tube and RI denote the ratio of the number of good strands to VI in the initial tube. VI is expressed in moles and RI, being a ratio of number of molecules, is unit free. Further, VY, RY, VN and RN are the total number of strands and ratio of good strands in yes and no tubes respectively. At this point of time we consider a thought experiment. Let us try to fish exactly one more molecule from the no tube to the yes tube. We dip exactly one molecule of primer in the no tube and let it be in the tube until it anneals with a molecule. Whatever strand it anneals with is transferred to the yes tube. Let PG denote the probability that we’ll indeed transfer a good strand to the yes tube. The primer may meet a good molecule in its first chance and anneal with it. In that case a good strand will be transferred to the yes tube. Otherwise, the primer may not anneal to anything in its first meeting, but then it meets a good molecule in its second meeting and anneal to it. Then also a good strand is transferred to the yes tube. In general, to transfer a good strand to the yes tube, the primer may not anneal to anything from 1,.,.i-1 meeting and then anneal to a good molecule in the i meeting for i=1,..,∝ . The probability of meeting a good molecule is RN assuming that the good strands are randomly distributed in the tube. The probability of not annealing with any strand in a particular meeting is RN q+(1-RN)q’. Therefore. ) ' ( ' .. ) ' ) 1 ( ( ) ' ) 1 ( ( 2 p p R p p R p R q R q R p R q R q R p R P N N N N N N N N N G − − = + − + + − + + = . At this point of time we may entertain three possibilities, PG>RY, PGRY, transferring the newly fished strand to the yes tube will increase RY and should be done. Similarly, if PG<RY, then the newly fished strand should not be transferred to the yes tube. Not only that, if there are a very large number of strands in the solution (i.e., if we assume that the parameters are continuous) then PG<RY implies that we should not have fished the previous strand also. Therefore, PG=RY gives us a local optima and any well designed simple extraction step must obey the equation. By substituting the expression of PG in the equation we get )) ' ( ' ( p p R p R p R N Y N − − = . In addition, note that the total number of good strands in the initial tube is same as the sum of good strands in the yes tube and no tube. The same relation holds for the bad strands also. By writing the expressions for the preservation of strands and by algebraic manipulation we get
منابع مشابه
Large deviations and the generalized processor sharing scheduling for a multiple-queue system
We establish asymptotic upper and lower bounds on the asymptotic decay rate of per-session queue length tail distributions for a multiple-queue system where a single constant rate server services the queues using the generalized processor sharing (GPS) scheduling discipline. As an indication of the tightness of the bounds, in the special case where there are only two queues, the upper and lower...
متن کاملA Bound on the Expected Optimality of Random Feasible Solutions to Combinatorial Optimization Problems
This paper investigates and bounds the expected solution quality of combinatorial optimization problems when feasible solutions are chosen at random. Loose general bounds are discovered, as well as families of combinatorial optimization problems for which random feasible solutions are expected to be a constant factor of optimal. One implication of this result is that, for graphical problems, if...
متن کاملFault Tolerant DNA Computing Based on Digital Microfluidic Biochips
Historically, DNA molecules have been known as the building blocks of life, later on in 1994, Leonard Adelman introduced a technique to utilize DNA molecules for a new kind of computation. According to the massive parallelism, huge storage capacity and the ability of using the DNA molecules inside the living tissue, this type of computation is applied in many application areas such as me...
متن کاملOn the Length of the Primal-Dual Path in Moreau-Yosida-based Path-following for State Constrained Optimal Control: Analysis and Numerics
We derive a-priori estimates on the length of the primal-dual path that results from a Moreau-Yosida approximation of the feasible set for state constrained optimal control problems. These bounds depend on the regularity of the state and the dimension of the problem. Comparison with numerical results indicates that these bounds are sharp and are attained for the case of a single active point. A...
متن کاملSIZE AND GEOMETRY OPTIMIZATION OF TRUSS STRUCTURES USING THE COMBINATION OF DNA COMPUTING ALGORITHM AND GENERALIZED CONVEX APPROXIMATION METHOD
In recent years, the optimization of truss structures has been considered due to their several applications and their simple structure and rapid analysis. DNA computing algorithm is a non-gradient-based method derived from numerical modeling of DNA-based computing performance by new computers with DNA memory known as molecular computers. DNA computing algorithm works based on collective intelli...
متن کاملFlow Shop Scheduling with Earliness, Tardiness and Intermediate Inventory Holding Costs
We consider the problem of scheduling customer orders in a flow shop with the objective of minimizing the sum of tardiness, earliness (finished goods inventory holding) and intermediate (work-in-process) inventory holding costs. We formulate this problem as an integer program, and based on approximate solutions to two different, but closely related, Dantzig-Wolfe reformulations, we develop heur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003